Part 1. Probability Theory (part b)
Recall a random generative process produces an outcome \(\omega \in \Omega\); events (e.g. \(A\), \(B\)) are sets of outcomes.
A random variable \(X\) is a function that maps each outcome \(\omega\) to an event expressed as a number. “Random variables are real-valued functions of outcomes” (A&M p. 38)
e.g. \(X(\omega) = 1\) means that the outcome \(\omega\) is part of the event 1.
Usually we just write \(X = 1\) and \(\text{Pr}[X = 1]\). (cf \(P(A)\) for events)
Can also think of \(X\) as a random process that produces numbers as outcomes, but A&M’s way distinguishes the random process itself from the researcher’s representation of it in numbers.
“Two random variables defined on the same sample space”
If the events of interest can be quantified, then
This is much easier to work with than
Blitzstein and Hwang: “Random variables provide numerical summaries of the experiment in question.”
Since a random variable \(X\) produces numbers, we can apply functions: e.g. \(X^2\), \(\sqrt{X}\), generically \(g(X)\)
This gives us a new number for each \(\omega\) – a new random variable.
We also want to describe a random variable using operators: e.g. \(E[X]\) (expectation), \(V[X]\) (variance).
This gives us a number to describe \(X\) – not a new random variable. (We may estimate these from samples, producing RVs, e.g. sample mean, sample variance.)
A random variable \(X\) is discrete if its range \(X(\Omega)\) is a countable set. For example, \(\{1,2,3\}\), \(\{1,2,3, \ldots\}.\)
A discrete RV has a probability mass function (PMF): \[f(x) = \text{Pr}[X = x], \forall x \in \mathbb{R}.\]
For example, the number of heads in two flips of a fair coin:
\[ f(x) = \begin{cases} 1/4 & x = 0 \\\ 1/2 & x = 1 \\\ 1/4 & x = 2 \\\ 0 & \text{otherwise} \end{cases} \]
The cumulative distribution function (CDF) of a random variable \(X\) is
\[ F(x) = \text{Pr}[X \leq x], \forall x \in \mathbb{R}\] The CDF is another way to fully describe a random variable.
For the coin flip example,
\[ F(x) = \begin{cases} 0 & x < 0 \\\ 1/4 & 0 \le x < 1 \\\ 3/4 & 1 \le x < 2 \\\ 1 & x \ge 2 \end{cases} \]
If a random variable could take on a continuum of values (i.e. \(X(\Omega)\) includes some interval of the real line), then we say it is continuous.
Probability density function (PDF) written \(f(x)\), CDF written \(F(x)\).
CDF is integral of PDF below \(x\):
\[F(x) = \text{Pr}[X \leq x] = \int_{-\infty}^x f(u) du\]
\[\text{Pr}[a \leq X \leq b] = \int_{a}^b f(u) du = F(b) - F(a)\]
We might have two random variables – e.g. flip a coin and draw a ball from an urn.
Or, in a data analysis problem we could view two characteristics as random variables, e.g.
We can describe two random variables \(X\) and \(Y\) with
\[ f(x,y) = \textrm{P}[X=x, Y=y], \forall x, y \in \mathbb{R} \]
\[ F(x,y) = \textrm{P}[X \leq x, Y \leq y], \forall x, y \in \mathbb{R} \]
Let \(X\) denote number of heads in two tosses of a fair coin.
Let \(Y\) denote number of heads in one toss of a fair coin.
| x | y | Pr[X = x, Y = y] |
|---|---|---|
| 0 | 0 | 1/8 |
| 0 | 1 | 1/8 |
| 1 | 0 | 1/4 |
| 1 | 1 | 1/4 |
| 2 | 0 | 1/8 |
| 2 | 1 | 1/8 |
Let \(X\) denote number of heads in two tosses of a fair coin.
Let \(Y\) denote number of heads in one toss of a fair coin.
\[ f(x, y) = \begin{cases} 1/8 & x = 0, y = 0 \\\ 1/8 & x = 0, y = 1 \\\ 1/4 & x = 1, y = 0 \\\ 1/4 & x = 1, y = 1 \\\ 1/8 & x = 2, y = 0 \\\ 1/8 & x = 2, y = 1 \\\ 0 & \text{otherwise} \end{cases} \]
Let \(X\) denote number of heads in two tosses of a fair coin.
Let \(Y\) denote number of heads in one toss of a fair coin.
Then \(\text{Pr}[X = x, Y = y]\) is given by this table:
| y = 0 | y = 1 | |
|---|---|---|
| x = 0 | 1/8 | 1/8 |
| x = 1 | 1/4 | 1/4 |
| x = 2 | 1/8 | 1/8 |
Recall: A joint PMF \(f(x, y) = \text{Pr}[X = x, Y = y]\) describes the distribution of two discrete RVs \(X\) and \(Y\).
We can also talk about the marginal PMF of one of the variables:
\[f_Y(y) = \text{Pr}[Y = y] = \sum_{x \in \text{Supp}[X]} f(x, y), \forall y \in \mathbb{R}.\]
Basically, this describes the distribution of \(Y\) ignoring \(X\).
This is an application of the Law of Total Probability.
With the \(X\)-by-\(Y\) representation of joint PMF, the marginal PMF of \(X\) is the row sums, marginal PMF of \(Y\) is column sums (written in the margins):
| y = 0 | y = 1 | ||
|---|---|---|---|
| x = 0 | 1/8 | 1/8 | 1/4 |
| x = 1 | 1/4 | 1/4 | 1/2 |
| x = 2 | 1/8 | 1/8 | 1/4 |
| 1/2 | 1/2 |
With the graphical representation, think about sweeping the mass to the axis:
For RVs \(X\) and \(Y\), we can also talk about the conditional PMF of \(Y\) at a value of \(X\):
\[f_{Y|X}(y|x) = \text{Pr}[Y = y \mid X = x] = \frac{\text{Pr}[X = x, Y = y]}{\text{Pr}[X = x]} = \frac{f(x, y)}{f_X(x)}\] \(\forall y \in \mathbb{R}\) and \(\forall x \in \text{Supp}[X]\).
With the \(X\)-by-\(Y\) representation of joint PMF, you get the conditional PMF of \(Y\) given \(X = x\) by dividing each row by the row sum (i.e. the marginal probability of \(X = x\)):
\(f(x, y)\)
| y = 0 | y = 1 | |
|---|---|---|
| x = 0 | 1/8 | 1/8 |
| x = 1 | 1/4 | 1/4 |
| x = 2 | 1/8 | 1/8 |
\(f_{Y|X}(y |x )\)
| y = 0 | y = 1 | |
|---|---|---|
| x = 0 | 1/2 | 1/2 |
| x = 1 | 1/2 | 1/2 |
| x = 2 | 1/2 | 1/2 |
With the graphical representation, think about taking a slice of the PMF and rescaling it:
Joint probability density function (PDF):
Another joint PDF:
\[f_Y(y) = \int_{-\infty}^{\infty}f(x,y) dx, \forall y \in \mathbb{R}\]
To get \(f_Y(y)\) for a specific \(y\), slice the joint pdf at \(Y = y\), and integrate (sum) \(f(x, y)\) across all values of \(x\).
To get \(f_Y(y)\) for all \(y\)s, think about squishing the PDF into the \(y\)-axis.
\[f_{Y \mid X}(y \mid x) = \frac{f(x,y)}{f(x)}, \forall y \in \mathbb{R} \, \text{and} \, \forall x \in \mathrm{Supp}[X]\]
For a specific \(x\), what does \(f(x,y)\) look like? What does \(f(x)\) look like?
\(f(x, y)\) for \(x=1\) is the intersection between the joint pdf and the plane \(x = 1\):
And \(f(x)\) is the integral of that intersection: numerically I compute it to be about 0.24.
Above is \(f_{Y \mid X} (y \mid x)\) for one value of \(x\). Here it is for all \(x\):
Questions that are best answered with
Joint dist. \(f(x,y)\):
Conditional dist. \(f_{Y|X}(y | x)\):
Can be very important to distinguish between the two!
Researchers often get confused about which conditional probability they are handling, \(\Pr[Y \mid X]\) or \(\Pr[X \mid Y]\).
\(\Pr[Y \mid X]\) or \(\Pr[X \mid Y]\) are connected by Bayes Rule.
Random variables \(X\) and \(Y\) are independent if, \(\forall x, y \in \mathbb{R}\):
\[f(x, y) = f_X(x) f_Y(y)\]
Equivalently,
\(X\) and \(Y\) are independent if the conditional distribution of \(X\) at every \(y\) is the same as the marginal distribution of \(X\), i.e. if \(f_{X | Y}(x | y) = f_X(x)\) \(\forall x \in \mathbb{R}\) and \(\forall y \in \text{Supp}[Y]\)
And vice versa.